Skip to main content

Web Crawls

The Web Archive of the Internet Archive started in late 1996 is made available through the Wayback Machine, and some collections are available in bulk to researchers.

Other than the pages collected by the Internet Archive, major contributors include Alexa Internet, Cuil, and those listed below.

1,559,121
RESULTS


collections 6,562
web 1,546,737
data 3,554
movies 1,101
audio 461
software 404
texts 289
images 13
TOPIC
crawldata 714,334
wiki 153,003
dumps 127,945
incremental 123,862
Wikipedia 48,495
Wiktionary 25,527
MediaWiki 25,134
wikiteam 25,106
unknowncopyright 15,284
no404 14,930
Wikibooks 13,380
Wikiquote 11,591
Wikisource 10,955
wikipedia 8,255
Wikimedia 7,220
wordpress 6,763
Wikinews 5,066
English 4,853
data dumps 4,061
Wikivoyage 2,882
Wikiversity 2,614
archiveteam 1,830
website 1,728
forum 1,612
web archive 1,605
discussion forum 1,601
Italian 1,405
German 1,404
French 1,403
Spanish 1,403
Greek 1,398
Swedish 1,377
Portuguese 1,373
Russian 1,373
Arabic 1,236
Korean 1,229
Japanese 1,228
Finnish 1,226
Czech 1,224
Hebrew 1,224
Ukrainian 1,212
Polish 1,204
Romanian 1,200
Persian 1,189
Chinese 1,180
Bulgarian 1,060
Bosnian 1,052
Esperanto 1,051
Catalan 1,050
Dutch 1,047
Vietnamese 1,037
Norwegian 1,034
Serbian 1,034
Tamil 1,033
Slovenian 1,031
Turkish 1,031
gardening 942
horticulture 942
plants 942
gardeners 941
Azerbaijani 884
Belarusian 882
Lithuanian 878
Limburgish 877
Marathi 877
Armenian 876
Latin 876
Welsh 876
Croatian 875
Estonian 875
Danish 874
Galician 874
Hungarian 874
Indonesian 874
Malayalam 873
Icelandic 871
Thai 867
Telugu 863
Albanian 862
Sanskrit 860
Slovak 859
Afrikaans 706
Gujarati 704
Breton 703
Bengali 702
Kannada 702
Georgian 701
Basque 700
Hindi 699
Kurdish 699
Macedonian 699
Kyrgyz 698
Urdu 691
theater 666
Broadway 665
London 665
New York City 665
theatre 665
musicals 664
Broadway musicals 663
West End 663
Las Vegas 662
West End musicals 662
Kazakh 528
Chinese (Min Nan) 527
Western Frisian 527
Interlingua 526
Faroese 525
Malagasy 525
Malay 525
Khmer 524
Uzbek 524
Sundanese 523
Nepali 522
Tatar 522
Norwegian Nynorsk 521
Occitan 520
Venetian 519
Wolof 518
Yiddish 518
Tagalog 517
Tajik 517
Sinhala 514
Punjabi 512
Web 494
Oriya 486
Crawl 462
Arkiver 461
Old English 427
mad 421
Interlingue 415
m.wsj.net 387
Assamese 362
Asturian 361
Irish 361
Corsican 360
Aymara 359
Luxembourgish 359
Nauru 359
Lingala 357
Volapuk 357
Amharic 356
Guarani 356
Kashmiri 356
Latvian 356
Low German 356
Simple English 355
Aragonese 354
Burmese 354
Mongolian 354
Uyghur 354
Cornish 353
Maori 353
Kashubian 352
Lao 352
Quechua 352
Turkmen 352
Walloon 352
Zulu 352
Cherokee 351
Chuvash 351
Scottish Gaelic 351
Fijian 350
Maltese 350
Manx 350
Upper Sorbian 350
Divehi 349
Hausa 349
Inuktitut 349
Javanese 349
Lojban 349
Swahili 349
WikiTeam 349
Ido 348
Kalaallisut 348
Pashto 348
Oromo 347
Samoan 346
Sindhi 346
Aromanian 345
Somali 345
Kinyarwanda 344
Southern Sotho 344
Swati 344
Tigrinya 344
Tok Pisin 344
Tswana 344
Western Punjabi 344
Sakha 343
Sango 343
Tsonga 343
Wikimedia Commons 343
Serbo-Croatian 342
Sicilian 342
339
Nāhuatl 286
amazonbooks 249
wikimedia 246
www.theguardian.com 237
code 228
LANGUAGE
english 39,953
german 1,566
spanish 915
russian 831
french 666
dutch 274
portuguese 263
italian 234
chinese 225
lithuanian 224
SHOW DETAILS
Title
Date Archived
Creator
Internet Archive Web Crawls
600,217
ITEMS
2.5B
VIEWS
2.5B
Crawl data collected by the Internet Archive. This data is currently not publicly accessible in this format. To view archived web pages, please visit the Wayback Machine.
Topic: webwidecrawl
Alexa Crawls
115,927
ITEMS
1.3B
VIEWS
1.3B
Crawl data donated by Alexa Internet. This data is currently not publicly accessible. Decryption Keys are kept in an item. Alexa is the leading provider of free, global web metrics. Search Alexa to discover the most successful sites on the web by keyword, category, or country.
Topic: webcrawl
Wide Crawls
280,392
ITEMS
996.6M
VIEWS
996.6M
Wide crawls of the Internet conducted by Internet Archive. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.
Live Web Proxy Crawls
8,723
ITEMS
651.5M
VIEWS
651.5M
Content crawled via the Wayback Machine Live Proxy. Liveweb proxy is a component of Internet Archive’s wayback machine project. The liveweb proxy captures the content of a web page in real time, archives it into a ARC or WARC file and returns the ARC/WARC record back to the wayback machine to process. The recorded ARC/WARC file becomes part of the wayback machine in due course of time.
Survey Crawls
39,195
ITEMS
316.4M
VIEWS
316.4M
Survey crawls of domains. This data is currently not publicly accessible.
web-group-internal
28,889
ITEMS
231.3M
VIEWS
231.3M
miscellaneous data
Topic: brad tofel
Custom Crawl Services
36,172
ITEMS
197.9M
VIEWS
by Internet Archive
197.9M
National library harvesting.
Topic: ccs
Wayback Indexes
554
ITEMS
196.8M
VIEWS
196.8M
Wayback indexes. This data is currently not publicly accessible.
Wide Crawl started April 2013
24,878
ITEMS
185.8M
VIEWS
185.8M
Web wide crawl with initial seedlist and crawler configuration from April 2013.
Survey Crawl April 2013
16,198
ITEMS
178.9M
VIEWS
178.9M
Survey crawl of domains started April 2013. This data is currently not publicly accessible.
Archive-It Digital Collection
115,355
ITEMS
178.2M
VIEWS
178.2M
The Archive-It Digital Collection
Topic: data archive
Focused Crawls
83,524
ITEMS
177.7M
VIEWS
by Internet Archive
177.7M
Focused crawls are collections of frequently-updated webcrawl data from narrow (as opposed to broad or wide) web crawls, often focused on a single domain or subdomain.
Topic: webcrawl
alexa_2007
7,635
ITEMS
156.1M
VIEWS
156.1M
this data is currently not publicly accessible.
Wide Crawl started January 2012
28,389
ITEMS
118M
VIEWS
118M
Web wide crawl with initial seedlist and crawler configuration from January 2012 using HQ software.
alexa_2006
6,505
ITEMS
116.8M
VIEWS
116.8M
this data is currently not publicly accessible.
Wide Crawl started April 2012
38,825
ITEMS
112.9M
VIEWS
112.9M
Web wide crawl with initial seedlist and crawler configuration from April 2012.
Wide Crawl started August 2013
21,701
ITEMS
109.7M
VIEWS
109.7M
Web wide crawl with initial seedlist and crawler configuration from August 2013.
Top Domains
43,625
ITEMS
107.2M
VIEWS
107.2M
A collection of deep web crawls of the most popular domains according to Alexa.com's rankings.
Around The World Crawl
2,147
ITEMS
91.7M
VIEWS
91.7M
Data crawled by Sloan Foundation on behalf of Internet Archive
Wiki Collections
342,636
ITEMS
84.1M
VIEWS
84.1M
Collections of Wiki data
Topics: crawls, data, wiki
Wikipedia Outlinks
5,347
ITEMS
83.4M
VIEWS
83.4M
Crawl of outlinks from wikipedia.org. These files are currently not publicly accessible. from Wikipedia: Wikipedia is a multilingual, web-based, free-content encyclopedia project operated by the Wikimedia Foundation and based on an openly editable model. The name "Wikipedia" is a portmanteau of the words wiki (a technology for creating collaborative websites, from the Hawaiian word wiki, meaning "quick") and encyclopedia. Wikipedia's articles provide links to guide the user...
Wide Crawl started October 2010
15,223
ITEMS
80.5M
VIEWS
80.5M
Web wide crawl with initial seedlist and crawler configuration from October 2010
Wide Crawl Started January 2013
14,975
ITEMS
79.7M
VIEWS
79.7M
Wide crawls of the Internet conducted by Internet Archive. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.
79M
Web wide crawl with initial seedlist and crawler configuration from September 2012.
Fix Broken Links Web Crawls
15,505
ITEMS
76M
VIEWS
76M
These crawls are part of an effort to archive pages as they are created and archive the pages that they refer to. That way, as the pages that are referenced are changed or taken from the web, a link to the version that was live when the page was written will be preserved. Then the Internet Archive hopes that references to these archived pages will be put in place of a link that would be otherwise be broken, or a companion link to allow people to see what was originally intended by a page's...
survey_com00000
2,534
ITEMS
75.5M
VIEWS
75.5M
Survey crawl of .com domains started January 2011.
Topic: webcrawl
Wide Crawl started October 2011
11,873
ITEMS
68.2M
VIEWS
68.2M
Web wide crawl with initial seedlist and crawler configuration from March 2011 using HQ software.
Wide Crawl started June 2014
45,310
ITEMS
62.6M
VIEWS
62.6M
Web wide crawl with initial seedlist and crawler configuration from June 2014.
Wide Crawl started March 2011
8,178
ITEMS
61.5M
VIEWS
61.5M
Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi. What’s in the data set: Crawl start date: 09 March, 2011 Crawl end date: 23 December, 2011 Number of captures: 2,713,676,341 Number of unique URLs: 2,273,840,159 Number of hosts: 29,032,069 The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix...
38_crawl
1,387
ITEMS
58.1M
VIEWS
58.1M
this data is currently not publicly accessible.
Top News
34,245
ITEMS
55.3M
VIEWS
55.3M
A collection of deep web crawls of the world's top news sites, curated from a variety of sources.
Alexa Crawl EG
1,671
ITEMS
55M
VIEWS
55M
Crawl EG from Alexa Internet. This data is currently not publicly accessible.
web_iq
2,616
ITEMS
49.5M
VIEWS
49.5M
Crawl performed by Internet Archive. This data is currently not publicly accessible.
45.1M
Crawls performed by Internet Archive on behalf of the National Library of Australia. This data is currently not publicly accessible.
Wikipedia Outlinks February 2012
2,860
ITEMS
44.9M
VIEWS
44.9M
Crawl of outlinks from wikipedia.org started February, 2012. These files are currently not publicly accessible.
alexa_web_2009
3,079
ITEMS
43.7M
VIEWS
43.7M
this data is currently not publicly accessible.
alexa_web_2010
2,993
ITEMS
42.9M
VIEWS
42.9M
this data is currently not publicly accessible.
web_wk
9,820
ITEMS
42.5M
VIEWS
42.5M
Crawl performed by Internet Archive. This data is currently not publicly accessible.
38.8M
This is a collection of web pages from Wikipedia and the pages that that wikipedia pages links to. The collecting of the pages is triggered by the page being created or changed. The idea is to bring a reliability to Wikipedia outlinks so that if the pages referenced by a Wikipedia article are changed or go away a reader can find what was originally referred to, and permanently. This is part of the Internet Archive's attempt to rid the web of broken links. As of October 2013, there was some...
26_crawl
1,466
ITEMS
35.6M
VIEWS
35.6M
this data is currently not publicly accessible.
National Library of Spain
6,684
ITEMS
35.5M
VIEWS
35.5M
Data collected by Internet Archive on behalf of the National Library of Spain. This data is currently not publicly accessible.
34.7M
Crawls of the french domain space performed by Internet Archive on behalf of Bibliotheque Nationale de France. This data is currently not publicly accessible.
Alexa Crawls DF
248
ITEMS
33.6M
VIEWS
33.6M
Crawl data donated by Alexa Internet. This data is currently not publicly accessible
51_crawl
1,138
ITEMS
32.1M
VIEWS
32.1M
this data is currently not publicly accessible.
31.2M
This is a collection of pages and embedded objects from WordPress blogs and the external pages that they link to. This uses a feed of new and changed pages from WordPress to the Internet Archive so it can be used for this purpose. With a plug-in for word press that checks for broken links and "link rot" and redirects them to the Wayback Machine, the Internet Archive hopes that those that publish with WordPress would bring a reliability to their users.
Wide Crawl started February 2014
9,584
ITEMS
30.9M
VIEWS
30.9M
Web wide crawl with initial seedlist and crawler configuration from February 2014.
35_crawl
1,179
ITEMS
28.8M
VIEWS
28.8M
this data is currently not publicly accessible.
alexa_1999
244
ITEMS
28.2M
VIEWS
28.2M
this data is currently not publicly accessible.
Alexa Crawls DO
492
ITEMS
28.2M
VIEWS
28.2M
Crawl data donated by Alexa Internet. This data is currently not publicly accessible
Shallow Crawls
1,039
ITEMS
27.5M
VIEWS
27.5M
Shallow crawls that collect content 1 level deep including embeds. This data is currently not publicly accessible.
52_crawl
2,589
ITEMS
27.3M
VIEWS
27.3M
this data is currently not publicly accessible.
web_el_2008
1,678
ITEMS
26.9M
VIEWS
26.9M
This data is currently not publicly accessible.
Alexa Crawls DY
1,325
ITEMS
24.2M
VIEWS
24.2M
Crawl data donated by Alexa Internet. This data is currently not publicly accessible
Alexa Crawl EI
1,403
ITEMS
24.2M
VIEWS
24.2M
Crawl EI from Alexa Internet. This data is currently not publicly accessible.
20th Century Web
331
ITEMS
23.7M
VIEWS
23.7M
Collection of web items from the 20th century.
Topics: web, 20th century
International News Crawls
3,043
ITEMS
23.7M
VIEWS
23.7M
Crawls of International News Sites
Alexa Crawls EA
1,312
ITEMS
23.6M
VIEWS
23.6M
Crawl data donated by Alexa Internet. This data is currently not publicly accessible
Topic: crawldata
Alexa Crawl DX
1,440
ITEMS
23.5M
VIEWS
23.5M
Crawl DX from Alexa Internet. This data is currently not publicly accessible.
web_tran
4,120
ITEMS
23.3M
VIEWS
23.3M
Crawl performed by Internet Archive. This data is currently not publicly accessible.
Survey Crawl May 2014
6,909
ITEMS
23M
VIEWS
23M
Survey crawl of domains started May 2014. This data is currently not publicly accessible.
web_mon
3,750
ITEMS
22.9M
VIEWS
22.9M
Crawl performed by Internet Archive. This data is currently not publicly accessible.
Green Crawl
148
ITEMS
22.7M
VIEWS
22.7M
Crawl data donated by Alexa Internet. This data is currently not publicly accessible.
29_crawl
1,565
ITEMS
22.4M
VIEWS
22.4M
this data is currently not publicly accessible.
Alexa Crawl DZ
1,204
ITEMS
22.4M
VIEWS
22.4M
Crawl DZ from Alexa Internet. This data is currently not publicly accessible.
Wikipedia Outlinks May 2011
1,544
ITEMS
22.3M
VIEWS
22.3M
Crawl of outlinks from wikipedia.org started May, 2011. These files are currently not publicly accessible.
Alexa Crawls DU
945
ITEMS
21.6M
VIEWS
21.6M
Crawl data donated by Alexa Internet. This data is currently not publicly accessible
alexa_ed
1,181
ITEMS
21.6M
VIEWS
21.6M
this data is currently not publicly accessible.
web_el
925
ITEMS
21.2M
VIEWS
21.2M
Crawl performed by Internet Archive. This data is currently not publicly accessible.
Cuil Crawl Data
26,257
ITEMS
20.8M
VIEWS
20.8M
Web crawl snapshot generously donated from cuil.com. This collection of pages mostly from 2007 and some from 2008, is about 310 terabytes of compressed data, and almost 60 billion URLs (mostly text). Cuil was a search engine that organized web pages by content and displayed relatively long entries along with thumbnail pictures for many results. Cuil said it had a larger index than any other search engine, with about 120 billion web pages. It went live on July 28, 2008. Cuil's servers were shut...
Alexa Crawl EH
1,216
ITEMS
19.6M
VIEWS
19.6M
Crawl EH from Alexa Internet. This data is currently not publicly accessible.
Alexa Crawls DR
755
ITEMS
19.6M
VIEWS
19.6M
Crawl data donated by Alexa Internet. This data is currently not publicly accessible
Alexa Crawl EB
650
ITEMS
19.3M
VIEWS
19.3M
Crawl EB from Alexa Internet. This data is currently not publicly accessible.
Alexa Crawls DQ
886
ITEMS
18.9M
VIEWS
18.9M
Crawl data donated by Alexa Internet. This data is currently not publicly accessible
alexa_dw
957
ITEMS
18.7M
VIEWS
18.7M
this data is currently not publicly accessible.
Alexa Crawl DL
413
ITEMS
18.5M
VIEWS
18.5M
Crawl DL from Alexa Internet. This data is currently not publicly accessible.